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Abstract 

Unsupervised  Information  Extraction  (UIE)  is  the 
task  of  extracting  knowledge  from  text  without  us¬ 
ing  hand-tagged  training  examples.  A  fundamen¬ 
tal  problem  for  both  UIE  and  supervised  IE  is  as¬ 
sessing  the  probability  that  extracted  information 
is  correct.  In  massive  corpora  such  as  the  Web, 
the  same  extraction  is  found  repeatedly  in  differ¬ 
ent  documents.  How  does  this  redundancy  impact 
the  probability  of  correctness? 

This  paper  introduces  a  combinatorial  “balls-and- 
urns”  model  that  computes  the  impact  of  sample 
size,  redundancy,  and  corroboration  from  multi¬ 
ple  distinct  extraction  rules  on  the  probability  that 
an  extraction  is  correct.  We  describe  methods 
for  estimating  the  model’s  parameters  in  practice 
and  demonstrate  experimentally  that  for  UIE  the 
model’s  log  likelihoods  are  15  times  better,  on  av¬ 
erage,  than  those  obtained  by  Pointwise  Mutual  In¬ 
formation  (PMI)  and  the  noisy-or  model  used  in 
previous  work.  For  supervised  IE,  the  model’s  per¬ 
formance  is  comparable  to  that  of  Support  Vector 
Machines,  and  Logistic  Regression. 

1  Introduction 

Information  Extraction  (IE)  is  the  task  of  automatically  ex¬ 
tracting  knowledge  from  text.  Unsupervised  IE  (UIE)  is  IE  in 
the  absence  of  hand-tagged  training  data.  Because  UIE  sys¬ 
tems  do  not  require  human  intervention,  they  can  recursively 
discover  new  relations,  attributes,  and  instances  in  a  rapid, 
scalable  manner  as  in  KnowItAll  [Etzioni  et  al,  2004; 
2005], 

A  fundamental  problem  for  both  supervised  IE  and  UIE 
is  assessing  the  probability  that  extracted  information  is  cor¬ 
rect.  As  explained  in  Section  5,  previous  IE  work  has  used 
a  variety  of  techniques  to  address  this  problem,  but  has 
yet  to  provide  an  adequate  formal  model  of  the  impact  of 
redundancy — repeatedly  obtaining  the  same  extraction  from 
different  documents — on  the  probability  of  correctness.  Yet 
in  massive  corpora  such  as  the  Web,  redundancy  is  one  of  the 
main  sources  of  confidence  in  extractions. 

An  extraction  that  is  obtained  from  multiple,  distinct  doc¬ 
uments  is  more  likely  to  be  a  bona  fide  extraction  than  one 


obtained  only  once.  Because  the  documents  that  “support” 
the  extraction  are,  by  and  large,  independently  authored,  our 
confidence  in  an  extraction  increases  dramatically  with  the 
number  of  supporting  documents.  But  by  how  much?  How 
do  we  precisely  quantify  our  confidence  in  an  extraction  given 
the  available  textual  evidence? 

This  paper  introduces  a  combinatorial  model  that  enables 
us  to  determine  the  probability  that  an  observed  extraction 
is  correct.  We  validate  the  performance  of  the  model  empiri¬ 
cally  on  the  task  of  extracting  information  from  the  Web  using 
KnowItAll. 

Our  contributions  are  as  follows: 

1.  A  formal  model  that,  unlike  previous  work,  explicitly 
models  the  impact  of  sample  size,  redundancy,  and  dif¬ 
ferent  extraction  rules  on  the  probability  that  an  extrac¬ 
tion  is  correct.  We  analyze  the  conditions  under  which 
the  model  is  applicable,  and  provide  intuitions  about  its 
behavior  in  practice. 

2.  Methods  for  estimating  the  model’s  parameters  in  both 
the  UIE  and  supervised  IE  tasks. 

3.  Experiments  that  demonstrate  the  model’s  improved  per¬ 
formance  over  the  techniques  used  to  assess  extraction 
probability  in  previous  work.  For  UIE,  our  model  is  a 
factor  of  15  closer  to  the  correct  log  likelihood  than  the 
noisy-or  model  used  in  previous  work;  the  model  is  20 
times  closer  than  KnowItAll’s  Pointwise  Mutual  In¬ 
formation  (PMI)  method  [Etzioni  et  al.,  2004],  which 
is  based  on  Turney’s  PMI-IR  algorithm  [Turney,  2001], 
For  supervised  IE,  our  model  achieves  a  19%  improve¬ 
ment  in  average  log  likelihood  over  the  noisy-or  model, 
but  is  only  marginally  better  than  SVMs  and  logistic  re¬ 
gression. 

The  remainder  of  the  paper  is  organized  as  follows.  Section 
2  introduces  our  abstract  probabilistic  model,  and  Section  3 
describes  its  implementation  in  practice.  Section  4  reports 
on  experimental  results  in  four  domains.  Section  5  contrasts 
our  model  with  previous  work;  the  paper  concludes  with  a 
discussion  of  future  work. 

2  The  Urns  Model 

Our  probabilistic  model  takes  the  form  of  a  classic  “balls- 
and-urns”  model  from  combinatorics.  We  first  consider  the 
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single  urn  case,  for  simplicity,  and  then  generalize  to  the  full 
multiple  Urns  Model  used  in  our  experiments.  We  refer  to  the 
model  simply  as  URNS. 

We  think  of  IE  abstractly  as  a  generative  process  that  maps 
text  to  extractions.  Extractions  repeat  because  distinct  docu¬ 
ments  may  yield  the  same  extraction.  For  example,  the  Web 
page  containing  “Scenic  towns  such  as  Yakima...”  and  the 
Web  page  containing  “Washington  towns  such  as  Yakima...” 
both  lead  us  to  believe  that  Yakima  is  a  correct  extraction  of 
the  relation  City  (x) . 

Each  extraction  is  modeled  as  a  labeled  ball  in  an  urn.  A 
label  represents  either  an  instance  of  the  target  relation,  or 
an  error.  The  information  extraction  process  is  modeled  as 
repeated  draws  from  the  urn,  with  replacement.  Thus,  in  the 
above  example,  two  balls  are  drawn  from  the  urn,  each  with 
the  label  “Yakima”.  The  labels  are  instances  of  the  relation 
City  (x) .  Each  label  may  appear  on  a  different  number  of 
balls  in  the  urn.  Finally,  there  may  be  balls  in  the  urn  with 
error  labels  such  as  “California”,  representing  cases  where 
the  IE  process  generated  an  extraction  that  is  not  a  member 
of  the  target  relation. 

Formally,  the  parameters  that  characterize  an  urn  are: 

•  C  -  the  set  of  unique  target  labels;  \C\  is  the  number  of 
unique  target  labels  in  the  urn. 

•  E  -  the  set  of  unique  error  labels;  \E\  is  the  number  of 
unique  error  labels  in  the  urn. 

•  num(b )  -  the  function  giving  the  number  of  balls  la¬ 
beled  by  b  where  b  £  CUE.  num.(B)  is  the  multi-set 
giving  the  number  of  balls  for  each  label  b  £  B. 

Of  course,  IE  systems  do  not  have  access  to  these  param¬ 
eters  directly.  The  goal  of  an  IE  system  is  to  discern  which 
of  the  labels  it  extracts  are  in  fact  elements  of  C,  based  on 
repeated  draws  from  the  urn.  Thus,  the  central  question  we 
are  investigating  is:  given  that  a  particular  label  x  was  ex¬ 
tracted  k  times  in  a  set  of  n  draws  from  the  urn,  what  is  the 
probability  that  x  £  C? 

In  deriving  this  probability  formally  below,  we  assume  the 
IE  system  has  access  to  multi-sets  num(C )  and  num(E)  giv¬ 
ing  the  number  of  times  the  labels  in  C  and  E  appear  on  balls 
in  the  urn.  In  our  experiments,  we  provide  methods  that  es¬ 
timate  these  multi-sets  in  both  unsupervised  and  supervised 
settings.  We  can  express  the  probability  that  an  element  ex¬ 
tracted  k  of  n  times  is  of  the  target  relation  as  follows: 

First,  we  have  that 


P(x  appears  k  times  inn  draws  |  x  £  C)  = 


P(num(x) 


r\x  £  C) 


where  s  is  the  total  number  of  balls  in  the  urn,  and  the  sum  is 
taken  over  possible  repetition  rates  r. 

Then  we  can  express  the  desired  quantity  using  Bayes 
Rule: 


P{ x  £  C\x  appears  k  times  in  n  draws)  = 

P(x  appears  k  times  inn  draws|a;  £  C)P(x  £  C) 
P{x  appears  k  times  in  n  draws) 


(1) 


Note  that  these  expressions  include  prior  information  about 
the  label  x  -  for  example,  P(x  £  C)  is  the  prior  probability 
that  the  string  x  is  a  target  label,  and  P(num(x)  =  r\x  £  C) 
represents  the  probability  that  a  target  label  x  is  repeated  on  r 
balls  in  the  urn.  In  general,  integrating  this  prior  information 
could  be  valuable  for  IE  systems;  however,  in  the  analysis  and 
experiments  that  follow,  we  make  the  simplifying  assumption 
of  uniform  priors,  yielding  the  following  simplified  form: 

Proposition  1 

P(x  £  C\x  appears  k  times  in  n  draws)  = 

E(  r\k(  i  _  r\n—k 

rdznum(C)\  s  '  \  s  ' 

E(L!_\k(-\  _  r^_\n-k 
r' £num(CUE)\  s  /  '  s  ' 

2.1  The  Uniform  Special  Case 

For  illustration,  consider  the  simple  case  in  which  all  labels 
from  C  are  repeated  on  the  same  number  of  balls.  That 
is,  num(ci)  =  Rc  for  all  c,;  £  C,  and  assume  also  that 
num{ei)  =  Re  for  all  e,  £  E.  While  these  assumptions 
are  unrealistic  (in  fact,  we  use  a  Zipf  distribution  for  num(b) 
in  our  experiments),  they  are  a  reasonable  approximation  for 
the  majority  of  labels,  which  lie  on  the  flat  tail  of  the  Zipf 
curve. 

Define  p  to  be  the  precision  of  the  extraction  process;  that 
is,  the  probability  that  a  given  draw  comes  from  the  target 
relation.  In  the  uniform  case,  we  have: 

=  \C\Rc 
P  \E\Re  +  \C\Rc 

The  probability  that  a  particular  element  of  C  appears  in  a 
given  draw  is  then  pc  =  jc\-  and  similarly  pe  =  jgp. 

Using  a  Poisson  model  to  approximate  the  binomial  from 
Proposition  1,  we  have: 

P(x  £  C\x  appears  k  times  in  n  draws)  « 


In  practice,  the  extraction  process  is  noisy  but  informative, 
so  pc  >  Pe-  Notice  that  when  this  is  true.  Equation  (2) 
shows  that  the  odds  that  x  £  C  increase  exponentially  with 
the  number  of  times  k  that  x  is  extracted,  but  also  decrease 
exponentially  with  the  sample  size  n. 

A  few  numerical  examples  illustrate  the  behavior  of  this 
equation.  The  examples  assume  that  the  precision  p  is  0.9. 
Let  |C|  =  \E\  =  2,000.  This  means  that  Rc  =  9  x  Re — 
target  balls  are  nine  times  as  common  in  the  urn  as  error  balls. 
Now,  for  k  =  3  and  n  =  10, 000  we  have  P(x  £  C)  = 
93.0%.  Thus,  we  see  that  a  small  number  of  repetitions  can 
yield  high  confidence  in  an  extraction.  However,  when  the 
sample  size  increases  so  that  n  =  20,  000,  and  the  other  pa¬ 
rameters  are  unchanged,  then  P(x  £  C)  drops  to  19.6%.  On 
the  other  hand,  if  C  balls  repeat  much  more  frequently  than 
E  balls,  say  Rc  =  90  x  Re  (with  \E\  set  to  20,000,  so  that 
p  remains  unchanged),  then  P(x  £  C)  rises  to  99.9%. 

The  above  examples  enable  us  to  illustrate  the  advantages 
of  URNS  over  the  noisy-or  model  used  in  previous  work  [Lin 


et  ah,  2003;  Agichtein  and  Gravano,  2000].  The  noisy-or 
model  assumes  that  each  extraction  is  an  independent  asser¬ 
tion,  correct  a  fraction  p  of  the  time,  that  the  extracted  label  is 
“true.”  The  noisy-or  model  assigns  the  following  probability 
to  extractions: 

Pnoisy-or(x  £  C\x  appears  k  times)  =  1  —  (1  —  p)k 

Therefore,  the  noisy-or  model  will  assign  the  same 
probability —  99.9% — in  all  three  of  the  above  examples. 
Yet,  as  explained  above,  99.9%  is  only  correct  in  the  case  for 
which  n  =  10,  000  and  Rq  =  90  x  Re-  As  the  other  two  ex¬ 
amples  show,  for  different  sample  sizes  or  repetition  rates,  the 
noisy-or  model  can  be  highly  inaccurate.  This  is  not  surpris¬ 
ing  given  that  the  noisy-or  model  ignores  the  sample  size  and 
the  repetition  rates.  Section  4  quantifies  the  improvements 
obtained  by  URNS  in  practice. 


2.2  Applicability  of  the  Urns  Model 

Under  what  conditions  does  our  redundancy  model  provide 
accurate  probability  estimates?  First,  labels  from  the  target 
set  C  must  be  repeated  on  more  balls  in  the  urn  than  labels 
from  the  E  set,  as  in  Figure  1 .  The  shaded  region  in  Figure  1 
represents  the  “confusion  region”  -  some  of  the  labels  in  this 
region  will  be  classified  incorrectly,  even  by  the  ideal  classi¬ 
fier  with  infinite  data,  because  for  these  labels  there  simply 
isn’t  enough  information  to  decide  whether  they  belong  to  C 
or  E.  Thus,  our  model  is  effective  when  the  confusion  re¬ 
gion  is  relatively  small.  Secondly,  even  for  small  confusion 
regions,  the  sample  size  n  must  be  large  enough  to  approxi¬ 
mate  the  two  distributions  shown  in  Figure  1;  otherwise  the 
probabilities  output  by  the  model  will  be  inaccurate. 

An  attractive  feature  of  URNS  is  that  it  enables  us  to  esti¬ 
mate  its  expected  recall  and  precision  as  a  function  of  sample 
size.  If  the  distributions  in  Figure  1  cross  at  the  dotted  line 
shown  then,  given  a  sufficiently  large  sample  size  n,  expected 
recall  will  be  the  fraction  of  the  area  under  the  C  curve  lying 
to  the  right  of  the  dotted  line. 

For  a  given  sample  size  n,  define  r?l  to  be  the  least  number 
of  appearances  k  at  which  an  extraction  is  more  likely  to  be 
from  the  C  set  than  the  E  set  (given  the  distributions  in  Figure 
1,  t n  can  be  computed  using  Proposition  1).  Then  we  have: 


E[T  rue  Positives]  = 


\C\ 


Tn- 1 

E  E 

rEnum(C)  k—0 


(T\k  (\ 
\sJ  V 


n—k 


where  we  define  “true  positives”  to  be  the  number  of  ex¬ 
tracted  labels  Ci  £  C  for  which  the  model  assigns  probability 

P(ci  £  C)  >  0.5. 

The  expected  number  of  false  positives  is  similarly: 


E  [FalsePositives]  = 


Tn- 1 


(;)*( 


k  /  n—k 

S; 


i£i-  £  £ 

r^num(E)  k—0 

The  expected  precision  of  the  system  can  then  be  approxi¬ 
mated  as: 

E  [T  ruePositives ] 


E  [Precision] 


E  [FalsePositives]  +  E\TruePositives] 


Figure  1 :  Schematic  illustration  of  the  number  of  distinct 
labels  in  the  C  and  E  sets  with  repetition  rate  r.  The  “con¬ 
fusion  region”  is  shaded. 

For  example,  consider  the  particular  num(C )  and 
num(E)  learned  (in  the  unsupervised  setting)  for  the  Film 
relation  in  our  experiments.  For  the  sample  size  n  =  134, 912 
used  in  the  experiments,  expected  number  of  true  positives  is 
26,133  and  expected  precision  is  70.2%,  which  is  close  to 
the  actual  observed  true  positives  of  23,408  and  precision  of 
67.7%.  Were  we  to  increase  the  sample  size  to  1,000,000, 
we  would  expect  that  true  positives  would  increase  to  47,609, 
and  precision  to  84.0%.  Thus,  URNS  and  the  above  equations 
enable  an  IE  system  to  intelligently  choose  its  sample  size 
depending  on  precision  and  recall  requirements  and  resource 
constraints,  even  in  the  absence  of  tagged  training  data. 

2.3  Multiple  Urns 

We  now  generalize  our  model  to  encompass  multiple  urns. 
Information  is  often  extracted  using  multiple,  distinct  mech¬ 
anisms  -  for  example,  an  IE  system  might  employ  several 
patterns  for  extracting  city  names,  e.g.  “cities  including  x” 
and  “x  and  other  towns.”  It  is  often  the  case  that  different  pat¬ 
terns  have  different  modes  of  failure,  so  extractions  appearing 
across  multiple  patterns  are  generally  more  likely  to  be  true 
than  those  appearing  for  a  single  pattern.  We  can  model  this 
situation  by  introducing  multiple  urns  where  each  urn  repre¬ 
sents  a  different  extraction  mechanism.1 

Thus,  instead  of  n  total  extractions,  we  have  a  sample  size 
nm  for  each  urn  m  £  M,  with  the  extraction  x  appearing 
km  times.  Let  A(x,  (Aq, . . . ,  km),  (n i, . . . ,  nm))  denote  this 
event.  Further,  let  Arn(x,  k,  n)  be  the  event  that  label  x  ap¬ 
pears  k  times  in  n  draws  from  urn  to,  and  assuming  that  the 
draws  from  each  urn  are  independent,  we  have: 

Proposition  2 

P)x  £  (7|A(x,  {k\i .  -  • ,  km) ,  (ri i, . . . ,  nm))) 

Sc«ec  FI meM  P(Am(ci>  km,  nm )) 
ThxeCvjE  n mSM  P(Am{%>  km,  11m)) 

With  multiple  urns,  the  distributions  of  labels  among  balls 
in  the  urns  are  represented  by  multi-sets  numm(C)  and 

*We  may  lump  several  mechanisms  into  a  single  um  if  they  tend 
to  behave  similarly. 


numm(E).  Expressing  the  correlation  between  nurnm(x) 
and  numm>  (x)  is  an  important  modeling  decision.  Multiple 
urns  are  especially  beneficial  when  the  repetition  rates  for  ele¬ 
ments  of  C  are  more  strongly  correlated  across  different  urns 
than  they  are  for  elements  of  E — that  is,  when  numm(x)  and 
numm'  {x)  tend  to  be  closer  to  each  other  for  x  G  C  than  for 
x  G  E.  Fortunately,  this  turns  out  to  be  the  case  in  prac¬ 
tice.  Section  3  describes  our  method  for  modeling  multi-urn 
correlation. 


3  Implementation  of  the  Urns  Model 


This  section  describes  how  we  implement  URNS  for  both  UIE 
and  supervised  IE,  and  identifies  the  assumptions  made  in 
each  case. 

In  order  to  compute  probabilities  for  extractions,  we  need 
a  method  for  estimating  num(C)  and  num(E).  For  the  pur¬ 
pose  of  estimating  these  sets  from  tagged  or  untagged  data, 
we  assume  that  num(C)  and  num(E)  are  Zipf  distributed, 
meaning  that  if  c,  is  the  ?'th  most  frequently  repeated  label  in 
C,  then  num(ci )  is  proportional  to  i~zc .  We  can  then  char¬ 
acterize  the  num(C )  and  num(E)  sets  with  five  parameters: 
the  set  sizes  \C\  and  \E\,  the  shape  parameters  zq  and  ze, 
and  the  extraction  precision  p. 

To  model  multiple  urns,  we  consider  different  extraction 
precisions  pm  for  each  urn,  but  make  the  simplifying  assump¬ 
tion  that  the  size  and  shape  parameters  are  the  same  for  all 
urns.  As  mentioned  in  Section  2,  we  expect  repetition  rate 
correlation  across  urns  to  be  higher  for  elements  of  the  C  set 
than  for  the  E  set.  We  model  this  correlation  as  follows:  first, 
elements  of  the  C  set  are  assumed  to  come  from  the  same 
location  on  the  Zipf  curve  for  all  urns,  that  is,  their  relative 
frequencies  are  perfectly  correlated.  Some  elements  of  the 
E  set  are  similar,  and  have  the  same  relative  frequency  across 
urns  -  these  are  the  systematic  errors.  However,  the  rest  of  the 
E  set  is  made  up  of  non-systematic  errors,  meaning  that  they 
appear  for  only  one  kind  of  extraction  mechanism  (for  exam¬ 
ple,  “Eastman  Kodak”  is  extracted  as  an  instance  of  Film 
only  in  phrases  involving  the  word  “film”,  and  not  in  those 
involving  the  word  “movie.”).  Formally,  non-systematic  er¬ 
rors  are  labels  that  are  present  in  some  urns  and  not  in  others. 
Each  type  of  non-systematic  error  makes  up  some  fraction  of 
the  E  set,  and  these  fractions  are  the  parameters  of  our  cor¬ 
relation  model.  Assuming  this  simple  correlation  model  and 
identical  size  and  shape  parameters  across  urns  is  too  restric¬ 
tive  in  general —  differences  between  extraction  mechanisms 
are  often  more  complex.  However,  our  assumptions  allow  us 
to  compute  probabilities  efficiently  (as  described  below)  and 
do  not  appear  to  hurt  performance  significantly  in  practice. 

With  this  correlation  model,  if  a  label  x  is  an  element  of  C 
or  a  systematic  error,  it  will  be  present  in  all  urns.  In  terms  of 
Proposition  2,  the  probability  that  a  label  x  appears  km  times 
in  nm  draws  from  to  is: 


P(Am{x,km,nm))  =  (fm{x))km  {1  ~  fmix))™ 

where  /m( x)  is  the  frequency  of  extraction  x.  That  is, 

fm{ci)  =  Pm.Qci~zc  forCj  G  C 
/m(e»)  =  (1  -  Pm)QEt~ZE  fore.;  G  E 


—  km 

(3) 


In  these  expressions,  i  is  the  frequency  rank  of  the  extraction, 
assumed  to  be  the  same  across  all  urns,  and  Qc  and  Qe  are 
normalizing  constants  such  that 

E  Qd~zc  =  E  =  1 

aeC  ei€-E 

For  a  non-systematic  error  x  which  is  not  present  in  urn  to, 
P(Am(x,km,nm))  is  1  if  km  =  0  and  0  otherwise.  Substi¬ 
tuting  these  expressions  for  P(Am(x,krn,nm))  into  Propo¬ 
sition  2  gives  the  final  form  of  our  URNS  model. 

3.1  Efficient  Computation 

A  feature  of  our  implementation  is  that  it  allows  for  effi¬ 
cient  computation  of  probabilities.  In  general,  computing 
the  sum  in  Proposition  2  over  the  potentially  large  C  and  E 
sets  would  require  significant  computation  for  each  extrac¬ 
tion.  However,  given  a  fixed  number  of  urns,  with  num(C) 
and  num(E)  Zipf  distributed,  an  integral  approximation  to 
the  sum  in  Proposition  2  (using  a  Poisson  in  place  of  the  bi¬ 
nomial  in  Equation  3)  can  be  solved  in  closed  form  in  terms 
of  incomplete  Gamma  functions.  This  closed  form  expres¬ 
sion  can  be  evaluated  quickly,  and  thus  probabilities  for  ex¬ 
tractions  can  be  obtained  efficiently.  This  solution  leverages 
our  assumptions  that  size  and  shape  parameters  are  identical 
across  urns,  and  that  relative  frequencies  are  perfectly  cor¬ 
related.  Finding  efficient  techniques  for  computing  proba¬ 
bilities  under  less  stringent  assumptions  is  an  item  of  future 
work. 


3.2  Parameter  Estimation 

In  the  event  that  a  large  sample  of  hand-tagged  training  ex¬ 
amples  is  available  for  each  target  relation  of  interest,  we 
can  directly  estimate  each  of  the  parameters  of  URNS.  We 
use  a  population-based  stochastic  optimization  technique  to 
identify  parameter  settings  that  maximize  the  conditional  log 
likelihood  of  the  training  data.2  Once  the  parameters  are  set, 
the  model  yields  a  probability  for  each  extraction,  given  the 
number  of  times  krn  it  appears  in  each  urn  and  the  number  of 
draws  nm  from  each  urn. 

As  argued  in  [Etzioni  et  al.,  2005],  IE  systems  cannot  rely 
on  hand-tagged  training  examples  if  they  are  to  scale  to  ex¬ 
tracting  information  on  arbitrary  relations  that  are  not  speci¬ 
fied  in  advance.  Implementing  URNS  for  UIE  requires  a  so¬ 
lution  to  the  challenging  problem  of  estimating  num(C)  and 
num(E)  using  untagged  data.  Let  U  be  the  multi-set  consist¬ 
ing  of  the  number  of  times  each  unique  label  was  extracted; 
U  is  the  number  of  unique  labels  encountered,  and  the  sam¬ 
ple  size  n  =  E„ec/  "- 

In  order  to  learn  num(C)  and  num(E)  from  untagged 
data,  we  make  the  following  assumptions: 

•  Because  the  number  of  different  possible  errors  is  nearly 
unbounded,  we  assume  that  the  error  set  is  very  large.3 

2  Specifically,  we  use  the  Differential  Evolution  routine  built  into 
Mathematica  5.0. 

3In  our  experiments,  we  set  E\  =  106.  A  sensitivity  analysis 
showed  that  changing  \E\  by  an  order  of  magnitude,  in  either  direc¬ 
tion,  resulted  in  only  small  changes  to  our  results. 


•  We  assume  that  both  num(C)  and  num(E)  are  Zipf  dis¬ 
tributed  where  the  ze  parameter  is  set  to  1 . 

•  In  our  experience  with  KnowItAll,  we  found  that 
while  different  extraction  rules  have  differing  precision, 
each  rule’s  precision  is  stable  across  different  relations 
[Etzioni  et  al.,  2005],  URNS  takes  this  precision  as  an 
input.  To  demonstrate  that  URNS  is  not  overly  sensitive 
to  this  parameter,  we  chose  a  fixed  value  (0.9)  and  used 
it  as  the  precision  pm  for  all  urns  in  our  experiments.  4 

We  then  use  Expectation  Maximization  (EM)  over  U  in  or¬ 
der  to  arrive  at  appropriate  values  for  \C\  and  zq  (these  two 
quantities  uniquely  determine  numiC)  given  our  assump¬ 
tions).  Our  EM  algorithm  proceeds  as  follows: 

1.  Initialize  \C\  and  zc  to  starting  values. 

2.  Repeat  until  convergence: 

(a)  E-step  Assign  probabilities  to  each  element  of  U 
using  Proposition  (1). 

(b)  M-step  Set  \C\  and  zc  from  U  using  the  probabil¬ 
ities  assigned  in  the  E-step  (details  below). 

We  obtain  \C\  and  zc  in  the  M-step  by  first  estimating  the 
rank-frequency  distribution  for  labels  from  C  in  the  untagged 
data.  From  the  untagged  data  and  the  probabilities  found  in 
the  E-step,  we  can  obtain  Ec[k ],  the  expected  number  of  la¬ 
bels  from  C  that  were  extracted  k  times.  We  then  round  these 
fractional  expected  counts  into  a  discrete  rank-frequency  dis¬ 
tribution  with  a  number  of  elements  equal  to  the  expected  to¬ 
tal  number  of  labels  from  C  in  the  untagged  data,  J2k  Ec[k]. 
We  obtain  zq  by  fitting  a  Zipf  curve  to  this  rank-frequency 
distribution  by  linear  regression  on  a  log-log  scale.  Lastly, 
we  set  |C|  =  ^2k  Ec[k\  +  unseen,  where  we  estimate  the 
number  of  unseen  labels  of  the  C  set  using  Good-Turing  esti¬ 
mation  ([Gale  and  Sampson,  1995]).  Specifically,  we  choose 
unseen  such  that  the  probability  mass  of  unseen  labels  is 
equal  to  the  expected  fraction  of  the  draws  from  C  that  ex¬ 
tracted  labels  seen  only  once. 

This  unsupervised  learning  strategy  proved  effective  for 
target  relations  of  different  sizes;  for  example,  the  number 
of  elements  of  the  Country  relation  with  non-negligible 
extraction  probability  was  about  two  orders  of  magnitude 
smaller  than  that  of  the  Film  and  City  relations. 

Clearly,  unsupervised  learning  relies  on  several  strong  as¬ 
sumptions,  though  our  sensitivity  analysis  has  shown  that  the 
model’s  performance  is  robust  to  some  of  them.  In  future 
work,  we  plan  to  perform  a  more  comprehensive  sensitivity 
analysis  of  the  model  and  also  investigate  its  performance  in 
a  semi-supervised  setting. 

4  Experimental  Results 

This  section  describes  our  experimental  results  under  two  set¬ 
tings:  unsupervised  and  supervised.  We  begin  by  describ¬ 
ing  the  two  unsupervised  methods  used  in  previous  work:  the 
noisy-or  model  and  PMI.  We  then  compare  URNS  with  these 

4A  sensitivity  analysis  showed  that  choosing  a  substantially 
higher  (0.95)  or  lower  (0.80)  value  for  pm  still  resulted  in  URNS 
outperforming  the  noisy-or  model  by  at  least  a  factor  of  8  and  PMI 
by  at  least  a  factor  of  10  in  the  experiments  described  in  Section  4.1. 


methods  experimentally,  and  lastly  compare  URNS  with  sev¬ 
eral  baseline  methods  in  a  supervised  setting. 

We  evaluated  our  algorithms  on  extraction  sets  for 
the  relations  City(x),  Film(x),  Country  (x),  and 
MayorOf  (x,  y ) ,  taken  from  experiments  performed  in  [Et¬ 
zioni  et  al,  2005],  The  sample  size  n  was  64,581  for  City, 
134,912  for  Film,  51,313  for  Country  and  46,129  for 
MayorOf.  The  extraction  patterns  were  partitioned  into 
urns  based  on  the  name  they  employed  for  their  target  re¬ 
lation  (e.g.  “country”  or  “nation”)  and  whether  they  were 
left-handed  (e.g.  “countries  including  x ”)  or  right-handed 
(e.g.  “x  and  other  countries”).  Each  combination  of  rela¬ 
tion  name  and  handedness  was  treated  as  a  separate  urn,  re¬ 
sulting  in  four  urns  for  each  of  City  (x) ,  Film  (x) ,  and 
Country  (x) ,  and  two  urns  for  MayorOf  (x)  ,5  For  each 
relation,  we  tagged  a  sample  of  1000  extracted  labels,  using 
external  knowledge  bases  (the  Tipster  Gazetteer  for  cities  and 
the  Internet  Movie  Database  for  films)  and  manually  tagging 
those  instances  not  found  in  a  knowledge  base.  In  the  UIE 
experiments,  we  evaluate  our  algorithms  on  all  1000  exam¬ 
ples,  and  in  the  supervised  IE  experiments  we  perform  10- 
fold  cross  validation. 

4.1  UIE  Experiments 

We  compare  URNS  against  two  other  methods  for  unsuper¬ 
vised  information  extraction.  First,  in  the  noisy-or  model 
used  in  previous  work,  an  extraction  appearing  k  times  is  as¬ 
signed  probability  1  —  ~Pm)k,  where  pm  is  the  ex¬ 

traction  precision  for  urn  m.  We  describe  the  second  method 
below. 

Pointwise  Mutual  Information 

Our  previous  work  on  KnowItAll  used  Pointwise  Mutual 
Information  (PMI)  to  obtain  probability  estimates  for  extrac¬ 
tions  [Etzioni  et  al.,  2005].  Specifically,  the  PMI  between 
an  extraction  and  a  set  of  automatically  generated  discrimi¬ 
nator  phrases  (e.g.,  “movies  such  as  x”)  is  computed  from 
Web  search  engine  hit  counts.  These  PMI  scores  are  used 
as  features  in  a  Naive  Bayes  Classifier  (NBC)  to  produce  a 
probability  estimate  for  the  extraction.  The  NBC  is  trained 
using  a  set  of  automatically  bootstrapped  seed  instances.  The 
positive  seed  instances  are  taken  to  be  those  having  the  high¬ 
est  PMI  with  the  discriminator  phrases  after  the  bootstrapping 
process;  the  negative  seeds  are  taken  from  the  positive  seeds 
of  other  relations,  as  in  other  work  (e.g.,  [Lin  et  al.,  2003]). 

Although  PMI  was  shown  in  [Etzioni  et  al.,  2005]  to  order 
extractions  fairly  well,  it  has  two  significant  shortcomings. 
First,  obtaining  the  hit  counts  needed  to  compute  the  PMI 
scores  is  expensive,  as  it  requires  a  large  number  of  queries  to 


sDraws  from  URNS  are  intended  to  represent  independent  ex¬ 
tractions.  Because  the  same  sentence  can  be  duplicated  across  multi¬ 
ple  different  Web  documents,  in  these  experiments  we  consider  only 
each  unique  sentence  containing  an  extraction  to  be  a  draw  from 
URNS.  In  experiments  with  other  possibilities,  including  counting 
the  number  of  unique  documents  producing  each  extraction,  or  sim¬ 
ply  counting  every  occurrence  of  each  extraction,  we  found  that  per¬ 
formance  differences  between  the  various  approaches  were  negligi¬ 
ble  for  our  task. 


web  search  engines.  Second,  the  seeds  produced  by  the  boot¬ 
strapping  process  tend  not  to  be  representative  of  the  overall 
distribution  of  extractions.  This  combined  with  the  probabil¬ 
ity  polarization  introduced  by  the  NBC  tends  to  give  inaccu¬ 
rate  probability  estimates. 

Discussion  of  UIE  Results 

The  results  of  our  unsupervised  experiments  are  shown  in 
Figure  2.  We  plot  deviation  from  the  ideal  log  likelihood — 
defined  as  the  maximum  achievable  log  likelihood  given  our 
feature  set. 

Our  experimental  results  demonstrate  that  URNS  over¬ 
comes  the  weaknesses  of  PMI.  First,  URNS’s  probabilities  are 
far  more  accurate  than  PMFs,  achieving  a  log  likelihood  that 
is  a  factor  of  20  closer  to  the  ideal,  on  average  (Figure  2). 
Second,  URNS  is  substantially  more  efficient  as  shown  in  Ta¬ 
ble  1. 

This  efficiency  gain  requires  some  explanation.  Know- 
ItAll  relies  on  queries  to  Web  search  engines  to  identify 
Web  pages  containing  potential  extractions.  The  number  of 
queries  KnowItAll  can  issue  daily  is  limited,  and  query¬ 
ing  over  the  Web  is,  by  far,  KnowItAll’s  most  expensive 
operation.  Thus,  number  of  search  engine  queries  is  our  effi¬ 
ciency  metric.  Let  d  be  the  number  of  discriminator  phrases 
used  by  the  PMI  method  as  explained  in  Section  4.1.  The 
PMI  method  requires  O(d)  search  engine  queries  to  compute 
the  PMI  of  each  extraction  from  search  engine  hit  counts.  In 
contrast,  URNS  computes  probabilities  directly  from  the  set 
of  extractions — requiring  no  additional  queries,  which  cuts 
KnowItAll’s  queries  by  factors  ranging  from  1.9  to  17. 

As  explained  in  Section  2,  the  noisy-or  model  ignores  tar¬ 
get  set  size  and  sample  size,  which  leads  it  to  assign  proba¬ 
bilities  that  are  far  too  high  for  the  Country  and  MayorOf 
relations,  where  the  average  number  of  times  each  label  is 
extracted  is  high  (see  bottom  row  of  Table  1).  This  is  fur¬ 
ther  illustrated  for  the  Country  relation  in  Figure  3.  The 
noisy-or  model  assigns  appropriate  probabilities  for  low  sam¬ 
ple  sizes,  because  in  this  case  the  overall  precision  of  ex¬ 
tracted  labels  is  in  fact  fairly  high,  as  predicted  by  the  noisy-or 
model.  However,  as  sample  size  increases  relative  to  the  num¬ 
ber  of  true  countries,  the  overall  precision  of  the  extracted  la¬ 
bels  decreases — and  the  noisy-or  estimate  worsens.  On  the 
other  hand,  URNS  avoids  this  problem  by  accounting  for  the 
interaction  between  target  set  size  and  sample  size,  adjust¬ 
ing  its  probability  estimates  as  sample  size  increases.  Given 
sufficient  sample  size,  URNS  performs  close  to  the  ideal  log 
likelihood,  improving  slightly  with  more  samples  as  the  es¬ 
timates  obtained  by  the  EM  process  become  more  accurate. 
Overall,  URNS  assigns  far  more  accurate  probabilities  than 
the  noisy-or  model,  and  its  log  likelihood  is  a  factor  of  15 
closer  to  the  ideal,  on  average.  The  very  large  differences  be¬ 
tween  URNS  and  both  the  noisy-or  model  and  PMI  suggest 
that,  even  if  the  performance  of  URNS  degrades  in  other  do¬ 
mains,  it  is  quite  likely  to  still  outperform  both  PMI  and  the 
noisy-or  model. 

Our  computation  of  log-likelihood  contains  a  numerical 
detail  that  could  potentially  influence  our  results.  To  avoid  the 
possibility  of  a  likelihood  of  zero,  we  restrict  the  probabilities 
generated  by  URNS  and  the  other  methods  to  lie  within  the 


City  Film  Country  MayorOf 


Figure  2:  Deviation  of  average  log  likelihood  from  the  ideal 
for  four  relations  (lower  is  better).  On  average,  Urns  out¬ 
performs  noisy-or  by  a  factor  of  15,  and  PMI  by  a  factor 
of  20. 


City 

Film 

MayorOf 

Country 

Speedup 

17. 3x 

9.5x 

1.9x 

3. lx 

Average  k 

3.7 

4.0 

20.7 

23.3 

Table  1:  Improved  Efficiency  Due  to  Urns.  The  top  row 
reports  the  number  of  search  engine  queries  made  by 
KnowItAll  using  PMI  divided  by  the  number  of  queries 
for  KnowItAll  using  Urns.  The  bottom  row  shows  that 
PMI’s  queries  increase  with  k — the  average  number  of 
distinct  labels  for  each  relation.  Thus,  speedup  tends  to 
vary  inversely  with  the  average  number  of  times  each  la¬ 
bel  is  drawn. 
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Figure  3:  Deviation  of  average  log  likelihood  from  the  ideal 
as  sample  size  varies  for  the  Country  relation  (lower  is 
better).  Urns  performs  close  to  the  ideal  given  sufficient 
sample  size,  whereas  noisy-or  becomes  less  accurate  as 
sample  size  increases. 


range  (0.00001,  0.99999).  Widening  this  range  tended  to  im¬ 
prove  URNS’s  performance  relative  to  the  other  methods,  as 
this  increases  the  penalty  for  erroneously  assigning  extreme 
probabilities — a  problem  more  prevalent  for  PMI  and  noisy- 
or  than  for  URNS.  Even  if  we  narrow  the  range  by  two  digits 
of  precision,  to  (0.001,  0.999),  URNS  still  outperforms  PMI 
by  a  factor  of  15,  and  noisy-or  by  a  factor  of  13.  Thus,  we  are 
comfortable  that  the  differences  observed  are  not  an  artifact 
of  this  design  decision. 

4.2  Supervised  IE  Experiments 

We  compare  URNS  with  three  supervised  methods.  All  meth¬ 
ods  utilize  the  same  feature  set  as  URNS,  namely  the  extrac¬ 
tion  counts  km. 

•  noisy-or  -  Has  one  parameter  per  urn,  making  a  set 
of  M  parameters  (hi, . . . ,  Hm),  and  assigns  probability 
equal  to 

l-  n  (i  -  hm)km. 

m£M 

•  logistic  regression  -  Has  M  +  1  parameters 
(a,  bi,  62, . . . ,  6m)>  and  assigns  probability  equal 
to 

1 

1  q_  ea+HmGM  k™bm  ' 

•  SVM  -  Consists  of  an  SVM  classifier  with  a  Gaussian 
kernel.  To  transform  the  output  of  the  classifier  into  a 
probability,  we  use  the  probability  estimation  built-in  to 
LIBSVM  [Chang  and  Lin,  2001],  which  is  based  on  lo¬ 
gistic  regression  of  the  SVM  decision  values. 

Parameters  maximizing  the  conditional  likelihood  of  the 
training  data  were  found  for  the  noisy-or  and  logistic  regres¬ 
sion  models  using  Differential  Evolution.  In  the  SVM  case, 
we  performed  grid  search  to  find  the  kernel  parameters  giv¬ 
ing  the  best  likelihood  performance  for  each  training  set  -  this 
grid  search  was  required  to  get  acceptable  performance  from 
the  SVM  on  our  task. 

The  results  of  our  supervised  learning  experiments  are 
shown  in  Table  2.  URNS,  because  it  is  more  expressive,  is 
able  to  outperform  the  noisy-or  and  logistic  regression  mod¬ 
els.  In  terms  of  deviation  from  the  ideal  log  likelihood,  we 
find  that  on  average  URNS  outperforms  the  noisy-or  model 
by  19%,  logistic  regression  by  10%,  but  SVM  by  only  0.4%. 


City 

Film 

Mayor 

Country 

Average 

noisy-or 

0.0439 

0.1256 

0.0857 

0.0795 

0.0837 

logistic 

regression 

0.0466 

0.0893 

0.0655 

0.1020 

0.0759 

SVM 

0.0444 

0.0865 

0.0659 

0.0769 

0.0684 

Urns 

0.0418 

0.0764 

0.0721 

0.0823 

0.0681 

Table  2:  Supervised  IE  experiments.  Deviation  from  the 
ideal  log  likelihood  for  each  method  and  each  relation 
(lower  is  better).  The  overall  performance  differences  are 
small,  with  Urns  19%  closer  to  the  ideal  than  noisy-or, 
on  average,  and  10%  closer  than  logistic  regression.  The 
overall  performance  of  SVM  is  close  to  that  of  Urns. 


5  Related  Work 

In  contrast  to  the  bulk  of  previous  IE  work,  our  focus  is  on  un¬ 
supervised  IE  (UIE)  where  URNS  substantially  outperforms 
previous  methods  (Figure  2). 

In  addition  to  the  noisy-or  models  we  compare  against  in 
our  experiments,  the  IE  literature  contains  a  variety  of  heuris¬ 
tics  using  repetition  as  an  indication  of  the  veracity  of  ex¬ 
tracted  information.  For  example,  Riloff  and  Jones  [Riloff 
and  Jones,  1999]  rank  extractions  by  the  number  of  distinct 
patterns  generating  them,  plus  a  factor  for  the  reliability  of 
the  patterns.  Our  work  is  intended  to  formalize  these  heuris¬ 
tic  techniques,  and  unlike  the  noisy-or  models,  we  explic¬ 
itly  model  the  distribution  of  the  target  and  error  sets  (our 
num(C)  and  num(E)),  which  is  shown  to  be  important  for 
good  performance  in  Section  4.1.  The  accuracy  of  the  proba¬ 
bility  estimates  produced  by  the  heuristic  and  noisy-or  meth¬ 
ods  is  rarely  evaluated  explicitly  in  the  IE  literature,  although 
most  systems  make  implicit  use  of  such  estimates.  For  ex¬ 
ample,  bootstrap-learning  systems  start  with  a  set  of  seed 
instances  of  a  given  relation,  which  are  used  to  identify  ex¬ 
traction  patterns  for  the  relation;  these  patterns  are  in  turn 
used  to  extract  further  instances  (e.g.  [Riloff  and  Jones,  1999; 
Lin  et  al.,  2003;  Agichtein  and  Gravano,  2000]).  As  this  pro¬ 
cess  iterates,  random  extraction  errors  result  in  overly  gen¬ 
eral  extraction  patterns,  leading  the  system  to  extract  further 
erroneous  instances.  The  more  accurate  estimates  of  extrac¬ 
tion  probabilities  produced  by  URNS  would  help  prevent  this 
“concept  drift.” 

Skounakis  and  Craven  [Skounakis  and  Craven,  2003]  de¬ 
velop  a  probabilistic  model  for  combining  evidence  from 
multiple  extractions  in  a  supervised  setting.  Their  problem 
formulation  differs  from  ours,  as  they  classify  each  occur¬ 
rence  of  an  extraction,  and  then  use  a  binomial  model  along 
with  the  false  positive  and  true  positive  rates  of  the  classi¬ 
fier  to  obtain  the  probability  that  at  least  one  occurrence  is  a 
true  positive.  Similar  to  the  above  approaches,  they  do  not 
explicitly  account  for  sample  size  n,  nor  do  they  model  the 
distribution  of  target  and  error  extractions. 

Culotta  and  McCallum  [Culotta  and  McCallum,  2004]  pro¬ 
vide  a  model  for  assessing  the  confidence  of  extracted  infor¬ 
mation  using  conditional  random  fields  (CRFs).  Their  work 
focuses  on  assigning  accurate  confidence  values  to  individual 
occurrences  of  an  extracted  field  based  on  textual  features. 
This  is  complementary  to  our  focus  on  combining  confidence 
estimates  from  multiple  occurrences  of  the  same  extraction. 
In  fact,  each  possible  feature  vector  processed  by  the  CRF  in 
[Culotta  and  McCallum,  2004]  can  be  thought  of  as  a  virtual 
urn  m  in  our  URNS.  The  confidence  output  of  Culotta  and 
McCallum’s  model  could  then  be  used  to  provide  the  preci¬ 
sion  pm  for  the  urn. 

Our  work  is  similar  in  spirit  to  BLOG,  a  language  for  speci¬ 
fying  probability  distributions  over  sets  with  unknown  objects 
[Milch  et  al.,  2004],  As  in  our  work,  BLOG  models  treat  ob¬ 
servations  as  draws  from  a  set  of  balls  in  an  urn.  Whereas 
BLOG  is  intended  to  be  a  general  modeling  framework  for 
probabilistic  first-order  logic,  our  work  is  directed  at  mod¬ 
eling  redundancy  in  IE.  In  contrast  to  [Milch  et  al.,  2004], 
we  provide  supervised  and  unsupervised  learning  methods 


for  our  model  and  experiments  demonstrating  their  efficacy 
in  practice. 

6  Conclusions  and  Future  Work 

This  paper  introduced  a  combinatorial  URNS  model  to  the 
problem  of  assessing  the  probability  that  an  extraction  is  cor¬ 
rect.  The  paper  described  supervised  and  unsupervised  meth¬ 
ods  for  estimating  the  parameters  of  the  model  from  data,  and 
reported  on  experiments  showing  that  URNS  massively  out¬ 
performs  previous  methods  in  the  unsupervised  case,  and  is 
slightly  better  than  baseline  methods  in  the  supervised  case. 
Of  course,  additional  experiments  and  a  more  comprehensive 
sensitivity  analysis  of  URNS  are  necessary. 

URNS  is  applicable  to  tasks  other  than  IE.  For  example, 
PMI  computed  over  search  engine  hit  counts  has  been  used 
to  determine  synonymy  [Turney,  2001],  and  for  question  an¬ 
swering  [Magnini  et  al.,  2002].  In  the  synonymy  case,  for 
example,  the  PMI  between  two  terms  is  used  as  a  measure  of 
their  synonymy;  applying  URNS  to  the  same  co-occurrence 
statistics  should  result  in  a  more  accurate  probabilistic  assess¬ 
ment  of  whether  two  terms  are  synonyms.  Comparing  URNS 
with  PMI  on  these  tasks  is  a  topic  for  future  work. 
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